Image processing

ABSTRACT

The present invention relates to a method of determining a sequence of intensity measures corresponding to each of a sequence of video frames, the method being used particularly but not exclusively for selection of videos according to user preferences and for providing highlights from a video sequence.  
     The method uses low level video characteristics which may be related to the to arousal and valence affects conveyed to a viewer whilst watching the video recording.

[0001] The present invention relates to a method of analysing a sequenceof video frames, the method being used particularly but not exclusivelyfor selection of video recordings according to user preferences and forproviding highlights from a video sequence.

[0002] The number of digital video databases in both professional andconsumer sectors is growing rapidly. These databases are characterisedby a steadily increasing capacity and content variety. Since searchingmanually through terabytes of unorganised data is tedious andtime-consuming, transferring search and retrieval tasks to automatedsystems becomes extremely important in order to be able to efficientlyhandle stored video.

[0003] Such automated systems rely upon algorithms for video contentanalysis, using models that relate certain signal properties of a videorecording to the actual video content.

[0004] Due to the large number of possibilities of analysing a videorecording, its content can be perceived in many different ways. Threedifferent levels of video content perception are defined correspondingto three different techniques for analysing a video recording. Theselevels are known as the feature level, the cognitive level and theaffective level.

[0005] Video analysis algorithms generally start at the feature level.Examples of features are how much red is in the image, or whetherobjects are moving within a sequence of images. Specifying a search taskat this level is usually the simplest option (e.g. “Find me a video clipfeaturing a stationary camera and a red blob moving from left toright!”).

[0006] At the cognitive level a user is searching for “facts”. Thesefacts can be, for example, a panorama of San Francisco, an outdoor or anindoor scene, a broadcast news report on a defined topic, a moviedialogue between particular actors or the parts of a basketball gameshowing fast breaks, steals and scores.

[0007] Specifying a search task at the cognitive level implies that avideo analysis algorithm is capable of establishing complex relationsamong features and recognizing, for instance, real objects, persons,scenery and story contexts. Video analysis and retrieval at thecognitive level can be provided using advanced techniques in computervision, artificial intelligence and speech recognition.

[0008] Most of the current worldwide research efforts in the field ofvideo retrieval have so far been invested in improving analysis at thecognitive level.

[0009] Owing to the rapidly growing technological awareness of users,the availability of automated systems that can optimally prepare videodata for easy access is important for commercial success ofconsumer-oriented multimedia databases. A user is likely to require moreand more from his electronic infrastructure at home, for examplepersonalised video delivery. Since video storage is likely to become abuffer for hundreds of channels reaching a home, an automated systemcould take into account the preferences of the user and filter the dataaccordingly. Consequently, developing reliable algorithms for matchinguser preferences to a particular video recording is desirable in orderto enable such personalised video delivery.

[0010] In this description we define the affective content of a videorecording as the type and amount of feeling or emotion contained in avideo recording which is conveyed to a user. Video analysis at theaffective level could provide for example, shots with happy people, aromantic film or the most exciting part of a video recording.

[0011] While cognitive level searching is one of the main requirementsof professional applications (journalism, education, politics etc),other users at home are likely to be interested in searching foraffective content rather than for “all the clips where a red aeroplaneappears”. For example finding photographs having a particular “mood” wasthe most frequent request of advertising customers in a study of imageretrieval made with Kodak Picture Exchange. An user may want to searchfor the “funniest” or “most sentimental” fragments of a video recording,as well as for the “most exciting” segments of a video recordingdepicting a sport event. Also in the case of a complex and large TVbroadcast such as the Olympic Games, the user is not able to watcheverything so it is desirable to be able to extract highlightsExtraction of the “most interesting” video clips and concatenation ofthem together in a “trailer”—is a particularly challenging task in thefield of video content analysis. Movie-producers hope to achieveenormous financial profits by advertising their products—movies—usingmovie excerpts that last only for several tens of seconds but arecapable of commanding the attention of a large number of potentialcinemagoers. Similarly other categories of broadcasts, especially thesport events, advertise themselves among the TV viewers using the “mosttouching scenes in the sport arena” with the objective of selling theircommercial blocks as profitably as possible. When creating the trailer,affective analysis of a video recording to be abstracted can provide themost important clues about which parts of a video recording are mostsuitable for being an element of it. Such a trailer can also be createdremotely—directly at user's home.

[0012] However, known algorithms do not address video analysis at thethird, affective level. Assuming that a “cognitive” analysis algorithmhas been used to find all video clips in a database that show SanFrancisco, additional manual effort is required to filter the extractedset of clips and isolate those that radiate a specific feeling (e.g.“romantic”) or those that the user simply “likes most”.

[0013] This invention seeks to address the task of video analysis at theaffective level. If video can be analysed at this level then it ispossible to provide improved personalization of video delivery services,video retrieval applications at the affective level of contentperception, and automation the video summarisation and highlightingprocesses.

[0014] Aspects of the invention are set forth in the claims.

An embodiment of the invention will now be described with reference tothe accompanying drawings in which

[0015]FIG. 1a is an Illustration of three dimensional valance, arousaland control space;

[0016]FIG. 2 is an illustration of two dimensional valence and arousalspace;

[0017]FIG. 3 is an illustration of the arousal and valence time curves;

[0018]FIG. 4 is an illustration of an affect curve;

[0019]FIG. 5 is a flowchart illustrating a series of method stepsaccording to the invention;

[0020]FIG. 6a illustrates a Kaiser window of the length 1500 and shapeparameter 5;

[0021]FIG. 6b illustrates motion activity;

[0022]FIG. 6c illustrates filtered motion activity;

[0023]FIG. 7a illustrates a cut frequency function;

[0024]FIG. 7b illustrates the cut frequency function of FIG. 7a afterfiltering;

[0025]FIG. 8a illustrates pitch measured in a sequence of 5000 frames;and

[0026]FIG. 8b illustrates averaged and interpolated pitch values withinsequence segments.

[0027] Firstly, in order to understand the invention a short descriptionrelating to “affect”, used in this description to mean emotion orfeeling, follows.

[0028] Affect may be defined using three basic underlying dimensions:

[0029] Valence (V)

[0030] Arousal (A)

[0031] Control (Dominance) (C)

[0032] Valence is typically characterised as a continuous range ofaffective responses extending from pleasant or “positive” to unpleasantor “negative”. As such, valence denotes the “sign” of emotion. Thedimension of arousal is characterised by a continuous response rangingfrom energised, excited and alert to calm, drowsy or peaceful. We canalso say that arousal stands for the “intensity” of emotion. The thirddimension—control (dominance)—is particularly useful in distinguishingbetween emotional states having similar arousal and valence (e.g.differentiating between “grief” and “rage”) and typically ranges from“no control” to “full control”. According to the model described abovethe entire range of human emotions can be represented as a set of pointsin the three dimensional VAC space.

[0033] While it can theoretically be assumed that points correspondingto different affective (emotional) states are equally likely to be foundanywhere in 3D VAC co-ordinate space, psychophysiological experimentshave shown that only certain areas of this space are actually relevant.Measurement of affective responses of a large group of subjects tocalibrated audio-visual stimuli show that subjects' affective responsesto these stimuli (quantified by measuring physiological functions) arerelated to particular affect dimensions. For example, heart rate andfacial electromyogram are reliable indicators of valence, whereas skinconductance is associated with arousal. The contour of the affectiveresponses (that is, the surface in this space representing the possible(or valid) combinations of values in the individual dimensions, asdetermined by psychophysiological studies) after mapping onto the threedimensional VAC space was roughly parabolic, as shown in FIG. 1. Thiscontour is said to define the three dimensional VAC emotion space. Thecharacteristic shape of the three dimensional VAC emotion space islogical as there are few (if any) stimuli that would cause an emotionalstate characterised by, for instance, high arousal and neutral valence(“no screaming without reason!”).

[0034] As can be seen from FIG. 1, the effect of the control dimensionbecomes visible only at points with distinctly high absolute valencevalues. This effect is also quite small, mainly due to a rather narrowrange of values belonging to this dimension. Consequently, it can besaid that the control dimension plays only a limited role incharacterizing various emotional states. Probably for this reason, onlyfew studies concern themselves with this dimension.

[0035] Numerous studies of human emotional responses to media have shownthat emotion elicited by pictures, television, radio, computers andsounds can be mapped onto an emotion space created by the arousal andvalence axes. For this reason, in this invention the control dimensionis ignored. Instead of the emotion space shown in FIG. 1, only theprojection of that space onto two dimensional VA co-ordinate space isused. An illustration of this space is shown in FIG. 2.

[0036] Measuring the arousal and valence values in a video recordingresults in the arousal and valence time curves, as illustrated in FIG.3. If treated separately, the arousal time curve may provide informationabout the positions of the “most exciting” video segments. Since theamount of excitement is the sole major criterion that determines user'sinterest in retrieving certain video genres (e.g. sport broadcasts), thearousal time curve can be considered as a fully sufficientcharacterization of the affective content in all programmes belonging tothese genres. A good example of using the arousal curve for retrievalapplications at the affective level is also illustrated in FIG. 3.Namely, the three segments of the arousal curve with highest arousalvalues can be joined together and used to create a clip showing, forinstance, all goals of a soccer match or all highlights of the lastOlympic Games.

[0037] The valence time curve can play a crucial role in filtering outthe “positive” and “negative” video segments. As such, it can contributeto fitting a video recording to personal preferences of the user, but itcan also be used for automatically performing “censorship” tasks, thatis, extracting all “negative” segments and so preventing certain classesof database users from viewing them.

[0038] If an arousal time function is plotted against a valence timefunction, then an affect function is obtained, as illustrated in FIG. 4.The affect function provides a complete representation of the affectivecontent of a video recording in two dimensional VA emotion space. Forinstance, the area of the coordinate system in which the curve “spends”most of the time corresponds to the prevailing affective state (“mood”)of a video recording, and so can be used to characterize the entirevideo content as rather “pessimistic”, “optimistic”, “stationary/boring”or “dynamic/interesting”. This can be useful for automaticallyclassifying a video recording into different genres. Further, the affectfunction can be used directly as a criterion for selecting videorecordings according to user preferences. An affect functionrepresenting user preferences can be obtained by simply combining theaffect functions of all programmes that the user has selected in alearning phase.

[0039] Selecting a video recording according to user preferences is thensimply a case of matching such a combined affect function with that of aparticular video recording. The affect function can be used forextracting video segments that are characterised by a certain mood.Furthermore individual segments that are most suitable for being part ofa movie trailer are those whose affect function passes through areas of“extreme emotions”, that is, through the upper left and the upper rightsector of the roughly parabolic two dimensional VA emotion space in FIG.2.

[0040] An affect function needs to have the following properties inorder to be useful in video analysis.

[0041] Comparability

[0042] Compatibility with VA emotion space

[0043] Smoothness

[0044] Continuity

[0045] Comparability ensures the suitability of an affect curve for thevideo retrieval applications. Where there is a requirement forpreference-driven video filtering, an affect curve measured for onevideo recording has to be comparable with an affect curve measured forany other video recording.

[0046] Compatibility with the VA emotion space secures the quality ofmodels used to obtain arousal and valence values in a video recording.These models can be considered useful only if the resulting affect curvecovers an area the shape of which roughly corresponds to theparabolic-like contour of the VA emotion space illustrated in FIG. 2.

[0047] Smoothness and continuity are required due to inertia in humanperception of a video recording and in human transition from oneaffective state to another. Smoothness accounts for the degree of memoryretention of preceding frames and shots. The perception of the content,does not change abruptly from one video frame to another but is afunction of a number of consecutive frames and shots. Continuity isbased on the assumption that the affective state evoked in the user atdifferent times in a video recording is not likely to change abruptly intime.

[0048] A description of an embodiment of the invention which provides anaffect curve for a video recording which has the above properties nowfollows.

[0049] Signal properties of a video recording that may be extracted andwhich are often referred to as low-level features include,

[0050] a) colour distribution within a video frame or a frame region;

[0051] b) texture features (distribution of frequency coefficients in atextured region, wavelet coefficients, textural energy, contrast,coarseness, directionality, repetitiveness, complexity,auto-correlation, co-occurrence matrix, fractal dimension,auto-regressive models, stochastic models, edge distribution,shape/contour parameters and models, spatial relationships betweenlines, regions, objects, directional and topological relationships);

[0052] c) motion vectors for frame regions providing the motionintensity and motion direction; and

[0053] d) audio and speech features (pitch, frequency spectrum,zero-crossings, phonemes, sound/voice quality, inflection, rhythm,etc.).

[0054] In addition to this, the information acquired through analysis ofediting effects, such as the frequency of shot changes, can be useful indetecting some aspects of video content. In the following we refer tolow-level features such as those enumerated above, and shot-boundarychanges, collectively as low-level video characteristics.

[0055] A number of psychophysiological studies have been performedconcerning the effect of non-content (structural) attributes of film andtelevision messages on the affective state of the user. These attributesinclude in general the screen size, viewing distance and the amount ofchrominance in a picture (e.g. black & white versus colour picture). Oneof extensively investigated attributes is motion. Motion in a sequenceof television pictures has a significant impact on individual affectiveresponses. An increase of motion intensity on the screen causes anincrease in arousal and in the magnitude of valence. The sign of valenceis, however, independent of motion: if the feeling of a test person waspositive or negative while watching a still picture, this feeling willnot change if a motion is introduced within that picture.

[0056] Various characteristics of an audio and/or speech stream of avideo programme provide valuable clues about the affective content ofthat programme. The pitch, loudness (signal energy), and speech rate(e.g. faster for fear or joy and slower for disgust or romance), forinstance, are known to be directly related to the arousal and magnitudeof valence. Also the inflection, rhythm, duration of the last syllableof a sentence and voice quality (e.g. breathy or resonant) are featuresthat can be related to the sign of valence.

[0057] Pitch is related to the sign of valence. Pitch represents thefundamental frequency of voiced speech and is calculated by analysing aspeech utterance. The fundamental frequency is the dominant frequency ofthe sound produced by the vocal cords. The pitch has a strong influenceon how the listener perceives the speaker's intonation and stress. Forexample, pitch values will cover greater frequency range for happinessthan for a “neutral” mood, while the frequency range will be smaller anda lower frequency than usual in the case of sadness.

[0058] Editing effects are useful to infer the values of some affectivedimensions. The inventors have found the density of cuts (abrupt shotboundaries) to be a useful measure. Cuts are a popular tool for thedirector to either create the desired pace of action (e.g. in a movie)or to react to interesting events in live broadcasts (e.g. goals in asoccer game). The director deliberately chooses shorter shot lengths inmovie segments he wishes to be perceived by the viewers as those with ahigh tempo of action development. By varying the cut density, a directorcontrols the action dynamics and thus the viewer's attention. Thereforethe varying cut density is related to the amount of arousal along amovie. In terms of the pace at which the video content is offered to aviewer, an increased shot-change rate has a similar effect on a viewer'sarousal as an increase in overall motion activity. The relation betweenthe cut density and arousal is even clearer when live broadcasts ofsport events are considered. For example a soccer match is broadcastmost of the time using one camera that covers the entire field andfollows the game in one continuous shot.

[0059] However, whenever there is a goal, the director immediatelyincreases the density of cuts trying to show everything that ishappening on the field and among the spectators at that moment. Thisincrease in cut density also appears whenever there is an importantbreak (e.g. due to foul play, free kick, etc.). Any increase in cutdensity during such broadcast is a direct reaction of a director to anincrease in the general arousal in the sport arena.

[0060] A method of generating an intensity measure corresponding to ameasure of arousal using low level video characteristics will now bedescribed with reference to the flow chart shown in FIG. 5.

[0061] At step 10 a video signal is received which comprises a sequenceof frames. At step 11 a motion activity value is determined for eachframe of the received signal. Motion activity is defined here as totalmotion in the picture, that is, as both motion of objects in a frame andcamera motion. Motion activity m(k) is determined at the video frame kas the average of magnitudes of all motion vectors obtained by applyinga block-based motion estimation procedure between frames k and k+1,normalised by the maximum possible length of a motion vector, that is$\begin{matrix}{{m(k)} = {\frac{100}{B\quad {\max\limits_{{i = 1},\ldots,B}\left( {{v_{i}(k)}} \right)}}\left( {\sum\limits_{j = 1}^{B}\quad {{v_{j}(k)}}} \right)\%}} & (1)\end{matrix}$

[0062] In equation (1), B is the number of blocks within a frame andv_(i)(k)is the motion vector obtained for the block i of the frame k.

[0063] The motion vectors can be estimated using standard block matchingbased techniques described for example in “Digital Video Processing” byA. Murat Tekalp, Prentice Hall Publisher, 1 995.

[0064] In this embodiment of the invention a frame k of a videorecording is divided into blocks of size 32 by 32 pixels. For each blockof an image this block is compared with a block of the same size inframe k−1, displaced by up to 9 pixels horizontally and vertically. Thedisplaced block in frame k−1 which is most similar to the block in framek provides the motion vector v_(i)(k).

[0065] The maximum possible length of the motion vector is therefore{square root}{square root over (9²+9²)}.

[0066] However, the motion activity curve obtained by computing thevalues m(k) according to Equation (1) in a video recording is notdirectly suitable for being included into the model for arousal. On theone hand, the value may strongly fluctuate within a shot and, on theother hand, it may fluctuate in different ranges for two consecutiveshots (e.g. total motion activity within a close-up shot is much largerthan that in a shot taken from a large distance). Since these suddenchanges do not comply with the desired properties of smoothness andcontinuity, as defined in the previous section, m(k) is convolved usinga Kaiser window at step 13. This time window is shown in FIG. 6a. Inthis embodiment of the invention it has the length 1500 video frames(i.e., at 25 frames/second, 60 seconds) and a shape parameter of 5. Inthe this embodiment of the invention the shape parameter affects thesidelobe attenuation of the Fourier transform of the window.

[0067] The effect of this convolution is illustrated using the motionactivity curve shown in FIG. 6b. The result of convolution is shown inFIG. 6c and calculated as $\begin{matrix}{{M(k)} = {\frac{\max\limits_{k}\left( {m(k)} \right)}{\max\limits_{k}\left( {{K\left( {l,\beta} \right)}^{*}{m(k)}} \right)}\left( {{K\left( {l,\beta} \right)}^{*}{m(k)}} \right)\quad \%}} & (2)\end{matrix}$

[0068] where K(l,β) is the Kaiser window, l is the window length and βis the shape parameter. “*” is the convolution operator. The motionactivity curve resulting from the convolution process is much morelikely to depict viewer's arousal changes

[0069] Scaling the windowed motion activity curve, as indicated by thefraction in equation (2), serves to ensure that the function M(k)remains in the same range as the function m(k). As it can be seen fromequation (2), the range is adjusted such that the maximum of thefunction M(k) remains the same as for the function m(k). Other values ofthe convolved signal are then scaled correspondingly. This explainsslight differences in ranges of the curves in FIGS. 6b and 6 c, whichare only segments of the entire m(k) and M(k) curves.

[0070] At step 16 an audio signal is received which corresponds to thevideo signal received at step 10. It will be appreciated that whilstthese steps are illustrated separately in FIG. 5, in a practical systemthese signals will be received simultaneously. Firstly the number s ofaudio samples that cover the same time period as one video frame isdetermined. If f is the frame rate of a video recording (normally 25 to30 frames per second) and F the audio sampling frequency (typically 44.1kHz for CD quality), then s is obtained as $\begin{matrix}{s = \frac{F}{f}} & (3)\end{matrix}$

[0071] The power spectrum for each consecutive segment of the audiosignal containing s samples is determined at step 17. At step 18 thesound energy value e_(high)(k)is then determined by summing all spectralvalues starting from a pre-defined cut-off frequency C. In thisembodiment of the invention the cut off frequency is set to 700 Hz.

[0072] At step 19 a Kaiser window is used to smooth the curvee_(high)(k). Abrupt peaks in the function e_(high)(k) imply that thearousal component related to sound energy can change greatly from oneframe to another. Since this is not the case, the signal is filteredusing a Kaiser window of length 1000 with a shape parameter equal to 5.Further, the result of the convolution procedure is normalised in orderto make the sound energy values independent of average sound volume usedin a video recording. This is important in order to compare arousalfunctions of video recordings that are taken under different conditions.The normalisation is performed by first normalizing the convolved curveby its maximum and then by multiplying it with the ratio of the maximumof the e_(high)(k) curve and the maximum of the curve e_(total)(k)of thetotal energy contained in s samples. The equation for the sound energyin higher frequencies that serves as the second component of the arousalmodel is given as $\begin{matrix}{{E(k)} = {\frac{100{\max\limits_{k}\left( {e_{high}(k)} \right)}}{\max\limits_{k}{\left( {e_{total}(k)} \right){\max\limits_{k}\left( {{K\left( {l,\beta} \right)}^{*}{e_{high}(k)}} \right)}}}\left( {{K\left( {l,\beta} \right)}^{*}{e_{high}(k)}} \right)\quad \%}} & (4)\end{matrix}$

[0073] At step 12 a cut density value is calculated for each frame.Similarly to the first two features a function of the frame index k isused that shows a relationship between a viewer's arousal and thetime-varying density of cuts.

[0074] There are various cuts detection methods available for example asdescribed in Zhang, H. J., A. Kankanhalli, and S. Smoliar, Automaticpartitioning of video. Multimedia Systems 1(1): 10-28. Cuts are normallydetermined by measuring changes in visual scene content in a videostream. When the variation is above certain threshold then a cut isdetected.

[0075] In this embodiment of the invention changes in the visual scenecontent are measured using a block based histogram. The frame is dividedinto nine blocks, for each block a histogram is calculated in three (R,G, B) colour bands, then histograms for corresponding blocks inconsecutive frames are compared to calculate the variation.

[0076] Function c(k) is defined as $\begin{matrix}{{c(k)} = {\frac{100}{{c_{next}(k)} - {c_{previous}(k)}}\quad \%}} & (5)\end{matrix}$

[0077] Here, c_(previous)(k) and c_(next)(k) are the frame number ofclosest cuts before and after the frame k respectively. FIG. 7a shows atypical function c(k) that is characterised by vertical edges at placesof cuts. Due to incompatibility between vertical edges of c(k) and theviewer's arousal that is characterised by inertia, function c(k) isfiltered at step 14 using a suitable Kaiser window. In this case awindow of length 1000 and a shape parameter equal to 10 is used. FIG. 7bshows the result of the filtering. The analytical description of theconvolution result is the function c(k) being the third feature of thearousal model: $\begin{matrix}{{C(k)} = {\frac{\max\limits_{k}\left( {c(k)} \right)}{\max\limits_{k}\left( {{K\left( {l,\beta} \right)}^{*}{c(k)}} \right)}\left( {{K\left( {l,\beta} \right)}^{*}{c(k)}} \right)\quad \%}} & (6)\end{matrix}$

[0078] As indicated in equation (6) the function c(k) is scaled tooccupy the same range as the function c(k).

[0079] Using the three features defined in equations (2, 4 and 6) anintensity measure A(k) is calculated at step 15 as the weighted averageof these components, that is $\begin{matrix}{{A(k)} = {\frac{{\alpha_{M}{M(k)}} + {\alpha_{E}{E(k)}} + {\alpha_{c}{C(k)}}}{\alpha_{M} + \alpha_{E} + \alpha_{C}}\quad \%}} & (7)\end{matrix}$

[0080] In this embodiment of the invention, values of 1, 1 and 10 areused for weighting factors α_(M), α_(E) and α_(C) respectively. Sinceeach of the three components contributing to the intensity function iscompliant with the properties of smoothness, continuity andcompatibility, the resulting intensity function complies with theseproperties as well.

[0081] It will be appreciated that other weighting factors may be used,including a weighting factor of zero, which effectively means that thatfeature is not taken into account when combining the intensity values atstep 10.

[0082] The most exciting parts of a video recording are those which havea high intensity measure, so frames having an intensity measure greaterthan a predetermined threshold may be selected to form a sequence ofhighlights from a film. Alternatively the most exciting frames may beselected in order to provide a sequence of predetermined length.

[0083] In this embodiment of the invention one low-level feature, pitch,is used to provide a measure of valence in a video recording. Using thisfeature both the magnitude and the sign of valence are estimated.

[0084]FIG. 8a shows the pitch signal measured for a segment of a videorecording. It can be seen that pitch values are obtained only at certainpoints—those where speech or other sounds is voiced. In this embodimentof the invention a pitch detector is used to provided a pitch estimatefor the audio signal received at step 16. An example of a suitable pitchdetector may be found in Speech and Audio Signal Processing: Processingand Perception of Speech and Music by Nelson Morgan, Ben Gold, JohnWiley & Sons 1999.

[0085] Pitch estimates that are within the range from 50 to 220 Hz areselected in order to remove the influence of signals other than voicewhich also occur in a video recording. The pitch detector may havefalsely recognised these other signals (music, noise) as voiced speech.However, we assume that the incorrect pitch estimates are positionedbeyond the usual pitch range for male speakers, typically from 80 to 200Hz. This pitch range may need to be increased to include female andchildren's speech.

[0086] In carrying out pitch analysis, we use a time window of 25 mslong (covering ˜1000 samples) to compute one pitch point for this shortsegment, this window then shifts by 5 ms along the time course, wecompute a second pitch point, . . . , and so on. In correspondence withthe video frame interval of 40 ms, we then have 8 pitch points. In theend we decide to take the median of these 8 pitch points to give onlyone pitch output for every 40 ms long video frame. Where pitchinformation is absent, an interpolated value is used. All pitch valuesin a segment h of length L (e.g. 800) video frames that are selected inthe previous step are averaged and the average is used as the pitchvalue p_(h) for the entire segment h. Once these pitch values have beendetermined at step 21 of FIG. 5 the pitch values are shifted at step 24to provide positive and negative values as follows.

[0087] Each pitch segment h with its pitch value p_(h) now correspondsto one segment of a video sequence containing L frames. So, each frame kbelonging to one video segment gets assigned the pitch value p_(h) ofthe corresponding pitch segment.

[0088] Performing this procedure along all video segments results in afunction P(k) as shown in FIG. 8b (where the horizontal axis representsthe number of video frames). This function can be represented as

P(k)=p _(h,k) with {p _(h,k) |p _(h,k) =p _(h) {circumflex over ( )} kεL_(h)}  (8)

[0089] Here, L_(h) is the video segment that corresponds to the pitchsegment h. Valence is now defined as

v(k)=P(k)−N   (9)

[0090] where N is a predetermined “neutral feeling” frequency. Sinceeach segment is now characterised by a single (the average) pitch value,as depicted in FIG. 8b, it can be said that if this value is below areference frequency N, the pervasive mood in the segment is presumablyrather sombre, in contrast with a relaxed and happy mood when theaverage pitch value is above this reference frequency. Then, if theaverage pitch value per segment is below a certain reference frequencyN, the mood in that segment is assumed as rather sombre, as opposed to arelaxed and happy mood if the average pitch is above this predeterminedfrequency.

[0091] The valence function is filtered (step 22) using a suitableKaiser window. Here we choose the window length 2000 and shape parameter5. The result of this convolution denoted {tilde over (v)}_(k) betweenthe function (9) and the Kaiser window selected provides the finalvalence function, calculated (in step 23, FIG. 5) as $\begin{matrix}{{V(k)} = {{\frac{{\max\limits_{k}\left( v_{k} \right)} - {\min\limits_{k}\left( v_{k} \right)}}{\max\limits_{k}\left( {{\overset{\sim}{v}}_{k} - {\min\limits_{k}\left( {\overset{\sim}{v}}_{k} \right)}}\quad \right)}\left( {{\overset{\sim}{v}}_{k} - {\min\limits_{k}\left( {\overset{\sim}{v}}_{k} \right)}} \right)}\quad + {\min\limits_{k}\left( v_{k} \right)}}} & (10)\end{matrix}$

[0092] Equation (10) describes that the normalised value V(k) rangesbetween$\min\limits_{k}{\left( v_{k} \right)\quad {and}\quad {\max\limits_{k}{\left( v_{k} \right).}}}$

[0093] Note that the denominator is a constant, as well as thenominator, with regard to all the frame values that k can assume.

[0094] It will be appreciated that the shifting step 24 could be doneafter filtering step 22 and indeed could be done in conjunction withstep 23.

[0095] Equations 10 and 7 can then be plotted against one another so asto provide a VA function such as that shown in FIG. 4. Each area of theVA space corresponds to a particular “mood” in the video recording.

[0096] Naturally such a graphical representation does not lend itself todirect automatic comparison. Therefore an “affect map” approach isproposed, whereby, instead of the graphical plot, a map or vector,characterising the particular video recording, is generated. If the VAspace (that is, the area of the graph in FIG. 4) is divided intorectangular areas, then the number of video frames having a combinationof A and V falling into each rectangle can be counted, giving a set ofvalues. For example if the space is divided into three ranges verticallyand six ranges horizontally then there would be eighteen squares andhence the affect map is 8×10 matrix (or an eighty-element vector). Afiner or coarser division may be chosen if desired, and the division maybe linear or non-linear.

[0097] Assuming a linear 8×10 division, and supposing that the map is amatrix W(i,j) (all 25 elements of which are initially zero) then theprocedure for each video frame, where the total number of frames isN_(F), is as follows:

[0098] a) quantise A to an integer A′ in the range 0 to 7;

[0099] b) quantise V to an integer V′ in the range 0 to 9;

[0100] c) increment element W(A′,V′) of W.

[0101] Once this has been done for all frames, then each element of Wcontains a frame count: each of these is then normalised by division byN_(F)—i.e.

W′(i,j)=W′(i,j)/N _(F) for i=0 . . . 7, j=0 . . . 9.

[0102] This describes how to construct the map for one videorecording—denoted in FIG. 5 as combination step 25. In order however tomake use of the map for selection of classification purposes, thengeneric maps may be constructed. For instance, if examples are availableof video recordings which have already been classified manually, thenthe average of the maps for these example recordings can be formed. Anunknown recording can then be analysed to obtain its affect map, andthis map compared with a number of generic maps: the new recording isthen assigned the classification associated with the closest matchinggeneric map.

[0103] For selection or recommendation of further recordings for aviewer whose viewing history is known, a generic “preference map” can beconstructed for that person, being the average of the individual mapsfor the video recordings he has chosen previously. New recordings canthen each be analysed and their maps compared with the preference mapand the one (or more) having the best match can be chosen.

[0104] For comparing two maps, a distance measure could easily begenerated by summing the absolute differences between the individualelements of the maps e.g. $\begin{matrix}{D = {\sum\limits_{i = 0}^{7}\quad {\sum\limits_{j = 0}^{9}\quad {{{W\left( {i,j} \right)} - {W_{g}\left( {i,j} \right)}}}}}} & (11)\end{matrix}$

[0105] (here W_(g) is the generic map); the pair having the smallestdistance D represent the closest match.

[0106] Other distance measures could be used; or, in a moresophisticated approach a model could be trained to make the comparisons.

[0107] In this embodiment of the invention a Kaiser window has been usedthroughout. In other embodiments of the invention other appropriatesymmetric windows such as Hamming, Hanning, Blackman window can also beused. Since convolution in the time domain is equivalent tomultiplication in the frequency domain then this convolution operationis equivalent to a filtering operation to remove high frequencies.

[0108] As will be understood by those skilled in the art, the method ofthis invention may be implemented in software running on a conventionalcomputer such as a personal computer. Such software can be contained onvarious transmission and/or storage mediums such as a floppy disc,CD-ROM, or magnetic tape so that the software can be loaded onto one ormore general purpose computers or could be downloaded over a computernetwork using a suitable transmission medium.

[0109] Finally, although reference has been made throughout thisdescription to video recordings, it will be understood that the methodsdescribed can be applied to other media, for example a cine film,provided that it is firstly converted into a video signal.

1. A method of determining a sequence of intensity measures, eachintensity measure corresponding to each of a sequence of video frames;the method comprising steps of determining a motion activity measure foreach of said sequence to provide a sequence of motion measures;filtering the sequence of motion activity measures; and providing thesequence of intensity measures using said filtered sequence of motionactivity measures.
 2. A method according to claim 1 further comprisingdetermining a measure of cut frequency to provide a sequence of cutfrequency measures; filtering the sequence of cut frequency measures;and combining the filtered measures to provide said sequence ofintensity measures.
 3. A method according to claim 1 or claim 2 in whichthe video frames have an associated audio track and further comprisingdetermining a measure of sound energy in high frequencies for the audiotrack to provide a sequence of high frequency measures; filtering thesequence of sound energy measures; normalising the filtered sequence ofsound energy measures; and combining the normalised sequence of soundenergy measures with the filtered measures to provide said sequence ofintensity measures.
 4. A method of forming an affect measure for asequence of video frames having an associated audio track, comprising:determining a sequence of intensity measures using a method according toany one of the preceding claims; determining a pitch measurement for theaudio track to provide a sequence of pitch measures; filtering thesequence of pitch measures; providing a sequence of valence measuresusing said filtered sequence of pitch measures; and combining thesequence of valence measures with the sequence of intensity measures toprovide a sequence of affect measures.
 5. A method of selecting framesfrom a sequence of video frames comprising determining a sequence ofintensity measures corresponding to each of the sequence of video framesusing the method according to any one of claims 1 to 3; and selectingframes from the sequence of video frames for which the intensity measureis greater than a predetermined threshold.
 6. A method of selectingframes from a sequence of video frames comprising determining a sequenceof intensity measures corresponding to each of the sequence of videoframes using the method according to any one of claims 1 to 3; andselecting frames from the sequence of video frames for which theintensity measure is greater than a threshold; in which the threshold isdetermined in order to select a predetermined number of frames.
 7. Amethod according to claim 4 in which the affect measure is generated by:defining ranges of the intensity measures and valence measures; andcounting for respective combinations of intensity range and valencerange the number of video frames whose measures fall into the respectiverange combination, to produce a multi-element affect measure.
 8. Amethod of generating an affect measure for a sequence of video frameshaving an accompanying sound track, comprising: (a) generating asequence of intensity measures for the frames of the sequence, saidintensity measures depending on the degree of visual activity in thesequence; (b) generating a sequence of valence measures for the framesof the sequence, said valence measures depending on the pitch of thesound track; and (c) defining ranges of the intensity measures andvalence measures; (d) counting for respective combinations of intensityrange and valence range the number of video frames whose measures fallinto the respective range combination, to produce a multi-element affectmeasure.
 9. A method according to claim 7 or 8 in which the counts arenormalised by division by the number of frames in the video sequence.10. A method according to claim 7, 8 or 9 including forming a genericaffect measure by combining the affect measures obtained for a pluralityof video frame sequences.
 11. A method according to any one of claims 7to 10 including the step of comparing an affect measure obtained for onesequence of video frames with the affect measure obtained for othervideo sequences.